Intermediate NumPy

Unidata Python Workshop


Overview:

  • Teaching: 20 minutes
  • Exercises: 25 minutes

Questions

  1. How do we work with the multiple dimensions in a NumPy Array?
  2. How can we extract irregular subsets of data?
  3. How can we sort an array?

Objectives

  1. Index and slice arrays
  2. Index arrays using true and false
  3. Index arrays using arrays of indices

1. Index and slice arrays

Indexing is how we pull individual data items out of an array. Slicing extends this process to pulling out a regular set of the items.


In [ ]:
# Convention for import to get shortened namespace
import numpy as np

In [ ]:
# Create an array for testing
a = np.arange(12).reshape(3, 4)

In [ ]:
a

Indexing in Python is 0-based, so the command below looks for the 2nd item along the first dimension (row) and the 3rd along the second dimension (column).


In [ ]:
a[1, 2]

Can also just index on one dimension


In [ ]:
a[2]

Negative indices are also allowed, which permit indexing relative to the end of the array.


In [ ]:
a[0, -1]

Slicing syntax is written as start:stop[:step], where all numbers are optional.

  • defaults:
    • start = 0
    • end = len(dim)
    • step = 1
  • The second colon is also optional if no step is used.

It should be noted that end represents one past the last item; one can also think of it as a half open interval: [start, end)


In [ ]:
# Get the 2nd and 3rd rows
a[1:3]

In [ ]:
# All rows and 3rd column
a[:, 2]

In [ ]:
# ... can be used to replace one or more full slices
a[..., 2]

In [ ]:
# Slice every other row
a[::2]

In [ ]:
# You can also slice using negative indices
a[:, :-1]
EXERCISE:
  • The code below calculates a two point average using a Python list and loop. Convert it do obtain the same results using NumPy slicing
  • Bonus points: Can you extend the NumPy version to do a 3 point (running) average?

In [ ]:
data = [1, 3, 5, 7, 9, 11]
out = []

# Look carefully at the loop. Think carefully about the sequence of values
# that data[i] takes--is there some way to get those values as a numpy slice?
# What about for data[i + 1]?
for i in range(len(data) - 1):
    out.append((data[i] + data[i + 1]) / 2)

print(out)

data = np.array([1, 3, 5, 7, 9, 11])
out = (data[:-1] + data[1:]) / 2
print(out)

data = np.array([1, 3, 5, 7, 9, 11])
out = (data[2:] + data[1:-1] + data[:-2]) / 3
print(out)
EXERCISE:
  • Given the array of data below, calculate the total of each of the columns (i.e. add each of the three rows together):

In [ ]:
data = np.arange(12).reshape(3, 4)

# total = ?

print(data[0] + data[1] + data[2])

\# Or we can use numpy's sum and use the "axis" argument
print(np.sum(data, axis=0))

The solution to the last exercise introduces an important concept when working with NumPy: the axis. This indicates the particular dimension along which a function should operate (provided the function does something taking multiple values and converts to a single value).

Let's look at a concrete example with sum:


In [ ]:
a

In [ ]:
# This calculates the total of all values in the array
np.sum(a)

In [ ]:
# Keep this in mind:
a.shape

In [ ]:
# Instead, take the sum across the rows:
np.sum(a, axis=0)

In [ ]:
# Or do the same and take the some across columns:
np.sum(a, axis=1)
EXERCISE:
  • Finish the code below to calculate advection. The trick is to figure out how to do the summation.

In [ ]:
# Synthetic data
temp = np.random.randn(100, 50)
u = np.random.randn(100, 50)
v = np.random.randn(100, 50)

# Calculate the gradient components
gradx, grady = np.gradient(temp)

# Turn into an array of vectors:
# axis 0 is x position
# axis 1 is y position
# axis 2 is the vector components
grad_vec = np.dstack([gradx, grady])
print(grad_vec.shape)

# Turn wind components into vector
wind_vec = np.dstack([u, v])

# Calculate advection, the dot product of wind and the negative of gradient
# DON'T USE NUMPY.DOT (doesn't work). Multiply and add.

advec = (wind_vec * -grad_vec).sum(axis=-1)
print(advec.shape)

Top


2. Indexing Arrays with Boolean Values

Numpy can easily create arrays of boolean values and use those to select certain values to extract from an array


In [ ]:
# Create some synthetic data representing temperature and wind speed data
np.random.seed(19990503)  # Make sure we all have the same data
temp = (20 * np.cos(np.linspace(0, 2 * np.pi, 100)) +
        50 + 2 * np.random.randn(100))
spd = (np.abs(10 * np.sin(np.linspace(0, 2 * np.pi, 100)) +
              10 + 5 * np.random.randn(100)))

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(temp, 'tab:red')
plt.plot(spd, 'tab:blue');

By doing a comparision between a NumPy array and a value, we get an array of values representing the results of the comparison between each element and the value


In [ ]:
temp > 45

We can take the resulting array and use this to index into the NumPy array and retrieve the values where the result was true


In [ ]:
print(temp[temp > 45])

So long as the size of the boolean array matches the data, the boolean array can come from anywhere


In [ ]:
print(temp[spd > 10])

In [ ]:
# Make a copy so we don't modify the original data
temp2 = temp.copy()

# Replace all places where spd is <10 with NaN (not a number) so matplotlib skips it
temp2[spd < 10] = np.nan
plt.plot(temp2, 'tab:red')

Can also combine multiple boolean arrays using the syntax for bitwise operations. MUST HAVE PARENTHESES due to operator precedence.


In [ ]:
print(temp[(temp < 45) & (spd > 10)])
EXERCISE:
  • Heat index is only defined for temperatures >= 80F and relative humidity values >= 40%. Using the data generated below, use boolean indexing to extract the data where heat index has a valid value.

In [ ]:
# Here's the "data"
np.random.seed(19990503)  # Make sure we all have the same data
temp = (20 * np.cos(np.linspace(0, 2 * np.pi, 100)) +
        80 + 2 * np.random.randn(100))
rh = (np.abs(20 * np.cos(np.linspace(0, 4 * np.pi, 100)) +
              50 + 5 * np.random.randn(100)))


# Create a mask for the two conditions described above
# good_heat_index = 



# Use this mask to grab the temperature and relative humidity values that together
# will give good heat index values
# temp[] ?


# BONUS POINTS: Plot only the data where heat index is defined by
# inverting the mask (using `~mask`) and setting invalid values to np.nan

import numpy as np

\# Here's the "data"
np.random.seed(19990503)  # Make sure we all have the same data
temp = (20 * np.cos(np.linspace(0, 2 * np.pi, 100)) +
        80 + 2 * np.random.randn(100))
rh = (np.abs(20 * np.cos(np.linspace(0, 4 * np.pi, 100)) +
              50 + 5 * np.random.randn(100)))


\# Create a mask for the two conditions described above
good_heat_index = (temp >= 80) & (rh >= 0.4)


\# Use this mask to grab the temperature and relative humidity values that together
\# will give good heat index values
print(temp[good_heat_index]) 

\# BONUS POINTS: Plot only the data where heat index is defined by
\# inverting the mask (using `~mask`) and setting invalid values to np.nan
temp[~good_heat_index] = np.nan
plt.plot(temp, 'tab:red')

Top


3. Indexing using arrays of indices

You can also use a list or array of indices to extract particular values--this is a natural extension of the regular indexing. For instance, just as we can select the first element:


In [ ]:
print(temp[0])

We can also extract the first, fifth, and tenth elements:


In [ ]:
print(temp[[0, 4, 9]])

One of the ways this comes into play is trying to sort numpy arrays using argsort. This function returns the indices of the array that give the items in sorted order. So for our temp "data":


In [ ]:
inds = np.argsort(temp)
print(inds)

We can use this array of indices to pass into temp to get it in sorted order:


In [ ]:
print(temp[inds])

Or we can slice inds to only give the 10 highest temperatures:


In [ ]:
ten_highest = inds[-10:]
print(temp[ten_highest])

There are other numpy arg functions that return indices for operating:


In [ ]:
np.*arg*?

Top